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Abstract 

We study a general class of online learning problems where the feedback is specified 
by a graph. This class includes online prediction with expert advice and the multi¬ 
armed bandit problem, but also several learning problems where the online player 
does not necessarily observe his own loss. We analyze how the structure of the feed¬ 
back graph controls the inherent difficulty of the induced T-round learning problem. 
Specifically, we show that any feedback graph belongs to one of three classes: strongly 
observable graphs, weakly observable graphs, and unobservable graphs. We prove that 
the first class induces learning problems with minimax regret, where a is 

the independence number of the underlying graph; the second class induces problems 
with 0(5^ej'2/3^ minimax regret, where 6 is the domination number of a certain por¬ 
tion of the graph; and the third class induces problems with linear minimax regret. 
Our results subsume much of the previous work on learning with feedback graphs and 
reveal new connections to partial monitoring games. We also show how the regret is 
affected if the graphs are allowed to vary with time. 


*Tel Aviv University, Tel Aviv, Israel, and Microsoft Research, Herzliya, Israel, nogaa@post.tau.ac.il. 

^Dipartimento di Informatica, Universita degli Studi di Milano, Milan, Italy, nicolo.cesa- 
bianchi@unimi.it. Parts of this work were done while the author was at Microsoft Research, Redmond. 

^Microsoft Research, Redmond, Washington; oferd@microsoft.com. 

^Technion—Israel Institute of Technology, Haifa, Israel, and Microsoft Research, Herzliya, Israel, 
tomerk@technion.ac.il. Parts of this work were done while the author was at Microsoft Research, Red¬ 
mond. 


1 



1 Introduction 


Online learning can be formulated as a repeated game between a randomized player and 
an arbitrary, possibly adversarial, environment (see, e.g., Cesa-Bianchi and Lugosi, 2006; 
Shalev-Sliwartz, 2011). We focus on the version of the game where, on each round, the 
player chooses one of K actions and incurs a corresponding loss. The loss associated with each 
action on each round is a number between 0 and 1, assigned in advance by the environment. 
The player’s performance is measured using the game-theoretic notion of regret, which is 
the difference between his cumulative loss and the cumulative loss of the best hxed action 
in hindsight. We say that the player is learning if his regret after T rounds is o(T). 

After choosing an action, the player observes some feedback, which enables him to learn 
and improve his choices on subsequent rounds. A variety of different feedback models are 
discussed in online learning. The most common is full feedback, where the player gets to 
see the loss of all the actions at the end of each round. This feedback model is often 
called prediction with expert advice (Cesa-Bianchi et ah, 1997; Littlestone and Warmuth, 
1994; Vovk, 1990). For example, imagine a single-minded stock market investor who invests 
all of his wealth in one of K stocks on each day. At the end of the day, the investor incurs 
the loss associated with the stock he chose, but he also observes the loss of all the other 
stocks. 

Another common feedback model is bandit feedback (Auer et al., 2002), where the player 
only observes the loss of the action that he chose. In this model, the player’s choices influence 
the feedback that he receives, so he has to balance an exploration-exploitation trade-off. On 
one hand, the player wants to exploit what he has learned from the previous rounds by 
choosing an action that is expected to have a small loss; on the other hand, he wants 
to explore by choosing an action that will give him the most informative feedback. The 
canonical example of online learning with bandit feedback is online advertising. Say that 
we operate an Internet website and we present one of K ads to each user that views the 
site. Our goal is to maximize the number of clicked ads and therefore we incur a unit loss 
whenever a user doesn’t click on an ad. We know whether or not the user clicked on the ad 
we presented, but we don’t know whether he would have clicked on any of the other ads. 

Full feedback and bandit feedback are special cases of a general framework introduced 
by Mannor and Shamir (2011), where the feedback model is specihed by a feedback graph. 
A feedback graph is a directed graph whose nodes correspond to the player’s K actions. A 
directed edge from action i to action j (when i = j this edge is called a self-loop) indicates 
that whenever the player chooses action i he gets to observe the loss associated with action 
j. The full feedback model is obtained by setting the feedback graph to be the directed 
clique (including all self-loops, see Fig. la). The bandit feedback model is obtained by the 
graph that only includes the self-loops (see Fig. lb). Feedback graphs can describe many 
other interesting online learning scenarios, as discussed below. 

Our main goal is to understand how the structure of the feedback graph controls the 
inherent difficulty of the induced online learning problem. While regret measures the perfor¬ 
mance of a specific player or algorithm, the inherent difficulty of the game itself is measured 
by the minimax regret, which is the regret incurred by an optimal player that plays against 
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the worst-case environment. Freund and Schapire (1997) proves that the minimax regret of 
the full feedback game is 0(\/Tlni^ while Auer et ah (2002) proves that the minimax re¬ 
gret of the bandit feedback game is 0(\/iFT). Both of these settings correspond to feedback 
graphs where all of the vertices have self-loops —we say that the player in these settings is 
self-aware: he observes his own loss value on each round. The minimax regret rates induced 
by self-aware feedback graphs were extensively studied in Alon et ah (2014). In this paper, 
we focus on the intriguing situation that occurs when the feedback graph is missing some 
self-loops, namely, when the player does not always observe his own loss. He is still account¬ 
able for the loss on each round, but he does not always know how much loss he incurred. 
As revealed by our analysis, the absence of self-loops can have a signihcant impact on the 
minimax regret of the induced game. 

An example of a concrete setting where the player is not always self-aware is the apple 
tasting problem (Helmbold et ah, 2000). In this problem, the player examines a sequence 
of apples, some of which may be rotten. For each apple, he has two possible actions: he 
can either discard the apple (action 1) or he can ship the apple to the market (action 2). 
The player incurs a unit loss whenever he discards a good apple and whenever he sends 
a rotten apple to the market. However, the feedback is asymmetric: whenever the player 
chooses to discard an apple, he hrst tastes the apple and obtains full feedback; on the other 
hand, whenever he chooses to send the apple to the market, he doesn’t taste it and receives 
no feedback at all. The feedback graph that describes the apple tasting problem is shown 
in Fig. Id. Another problem that is closely related to apple tasting is the revealing action 
or label efficient problem (Cesa-Bianchi and Lugosi, 2006, Example 6.4). In this problem, 
one action is a special action, called the revealing action, which incurs a constant unit loss. 
Whenever the player chooses the revealing action, he receives full feedback. Whenever the 
player chooses any other action, he observes no feedback at all (see Fig. le). 

Yet another interesting example where the player is not self-aware is obtained by setting 
the feedback graph to be the loopless clique (the directed clique minus the self-loops, see 
Fig. Ic). This problem is the complement to the bandit problem: when the player chooses 
an action, he observes the loss of all the other actions, but he does not observe his own 
loss. To motivate this, imagine a police officer who wants to prevent crime. On each day, 
the officer chooses to stand in one of K possible locations. Criminals then show up at some 
of these locations: if a criminal sees the officer, he runs away before being noticed and the 
crime is prevented; otherwise, he goes ahead with the crime. The officer gets a unit reward 
for each crime he prevents,^ and at the end of each day he receives a report of all the crimes 
that occurred that day. By construction, the officer does not know if his presence prevented 
a planned crime, or if no crime was planned for that location. In other words, the officer 
observes everything but his own reward. 

Our main result is a full characterization of the minimax regret of online learning problems 
defined by feedback graphs. Specifically, we categorize the set of all feedback graphs into 
three distinct sets. The first is the set of strongly observable feedback graphs, which induce 

^It is easier to describe this example in terms of maximizing rewards, rather than minimizing losses. In 
our formulation of the problem, a reward of r is mathematically equivalent to a loss of 1 — r. 
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online learning problems whose minimax regret is where a is the independence 

number of the feedback graph. This slow-growing minimax regret rate implies that the 
problems in this category are easy to learn. The set of strongly observable feedback graphs 
includes the set of self-aware graphs, so this result extends the characterization given in 
Alon et al. (2014). The second category is the set of weakly observable feedback graphs, 
which induce learning problems whose minimax regret is where 5 is a new 

graph-dependent quantity called the weak domination number of the feedback graph. The 
minimax regret of these problems grows at a faster rate of with the number of rounds, 
which implies that the induced problems are hard to learn. The third category is the set of 
unobservable graphs, which induce unlearnable 0(T) online problems. 

Our characterization bears some surprising implications. For example, the minimax re¬ 
gret for the loopless clique is the same, up to constant factors, as the 0(\/TIn K) minimax 
regret for the full feedback graph. However, if we start with the full feedback graph (the 
directed clique with self-loops) and remove a self-loop and an incoming edge from any node 
(see Fig. If), we are left with a weakly observable feedback graph, and the minimax re¬ 
gret jumps to order Another interesting property of our characterization is how the 

two learnable categories of feedback graphs depend on completely different graph-theoretic 
quantities: the independence number a and the weak domination number 5. 

The setting of online learning with feedback graphs is closely related to the more gen¬ 
eral setting of partial monitoring (see, e.g., Cesa-Bianchi and Lugosi, 2006, Section 6.4), 
where the player’s feedback is specihed by a feedback matrix, rather than a feedback graph. 
Partial monitoring games have also been categorized into three classes: easy problems 
with regret, hard problems with regret, and unlearnable problems with linear 
regret (Bartok et al., 2014, Theorem 2). If the loss values are chosen from a hnite set (say 
{0,1}), then bandit feedback, apple tasting feedback, and the revealing action feedback 
models are all known to be special cases of partial monitoring. In fact, in Appendix D we 
show that any problem in our setting (with binary losses) can be reduced to the partial 
monitoring setting. Nevertheless, the characterization presented in this paper has several 
clear advantages over the more general characterization of partial monitoring games. First, 
our regret bounds are minimax optimal not only with respect to T, but also with respect to 
the other relevant problem parameters. Second, we obtain our upper bounds with a simple 
and efficient algorithm. Third, our characterization is stated in terms of simple and intuitive 
combinatorial properties of the problem. 

The paper is organized as follows. In Section 2 we dehne the problem setting and state 
our main results. In Section 3 we describe our player algorithm and prove upper bounds on 
the minimax regret. In Section 4 we prove matching lower bounds on the minimax regret. 
Finally, in Section 5 we extend our analysis to the case where the feedback graph is neither 
hxed nor known in advance. 
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(d) 



Figure 1: Examples of feedback graphs: (a) full feedback, (b) bandit feedback, (c) loopless 
clique, (d) apple tasting, (e) revealing action, (f) a clique minus a self-loop and another edge. 

2 Problem Setting and Main Results 

Let G = iy, E) be a directed feedback graph over the set of actions V = {1,..., K}. For 
each i ^ V, let = {j E V : {j,i) G E} be the in-neighborhood of i in G, and let 

= {j E V : {i,j) E E} be the out-neighborhood of i in G. If i has a self-loop, that 
is {i,i) E E, then i e and i G 

Before the game begins, the environment privately selects a sequence of loss functions 
£i, £ 2 - • • •, where It V [0,1] for each t > 1. On each round t = 1, 2,..., the player 
randomly chooses an action It E V and incurs the loss Iti.It)- At the end of round t, 
the player receives the feedback {{j,itij)) ■ j £ In words, the player observes 

the loss associated with each vertex in the out-neighborhood of the chosen action It. In 
particular, if It has no self-loop, then the player’s loss Itih) remains unknown, and if the 
out-neighborhood of It is empty, then the player does not observe any feedback on that 
round. The player’s expected regret against a specihc loss sequence G,. .. ,It is dehned as 
— miujev "^I^® inherent difficulty of the T-round online learning 

problem induced by the feedback graph G is measured by the minimax regret, denoted by 
R{G, T) and dehned as the minimum over all randomized player strategies, of the maximum 
over all loss sequences, of the player’s expected regret. 
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2.1 Main Results 


The main result of this paper is a complete characterization of the minimax regret when the 
feedback graph G is hxed and known to the player. Our characterization relies on various 
properties of G, which we dehne below. 

Definition (Observability). In a directed graph G = {V,E) a vertex z G Id is observable 
if 7 ^ 0. A vertex is strongly observable if either {z} C A^“(z), or Id \ {z} C A^™(z), 

or both. A vertex is weakly observable if it is observable but not strongly. A graph G is 
observable if all its vertices are observable and it is strongly observable if all its vertices are 
strongly observable. A graph is weakly observable if it is observable but not strongly. 

In words, a vertex is observable if it has at least one incoming edge (possibly a self¬ 
loop), and it is strongly observable if it has either a self-loop or incoming edges from all 
other vertices. Note that a graph with all of the self-loops is necessarily strongly observable. 
However, a graph that is missing some of its self-loops may or may not be observable or 
strongly observable. 

Definition (Weak Domination). In a directed graph G = {V,E) with a set of weakly ob¬ 
servable vertices Hd C Id, a weakly dominating set D C Id is a set of vertices that dominates 
W. Namely, for any zu G Hd there exists d & D such that w G The weak domination 

number of G, denoted by S{G), is the size of the smallest weakly dominating set. 

Our characterization also relies on a more standard graph-theoretic quantity. An inde¬ 
pendent set S' C Id is a set of vertices that are not connected by any edges. Namely, for any 
u,v E S, u ^ V it holds that {u,v) ^ E. The independence number a{G) of G is the size of 
its largest independent set. Our characterization of the minimax regret rates is given by the 
following theorem. 

Theorem 1. Let G = (V, E) be a feedback graph with \V\ >2, fixed and known in advance. 
Let a = a{G) denote its independence number and let 6 = S{G) denote its weak domination 
number. Then the minimax regret of the T-round online learning problem induced by G, 
where T > |ld|^, is 

(i) R{G,T) = 0(a^/^T^/^) if G is strongly observable; 

(a) R{G,T) = 0(5^/^ T^/^) if G is weakly observable; 

(Hi) R{G,T) = 0(T) if G is not observable. 

As mentioned above, this characterization has some interesting consequences. Any 
strongly observable graph can be turned into a weakly observable graph by removing at 
most two edges. Doing so will cause the minimax regret rate to jump from order \/T to 
order Even more remarkably, removing these edges will cause the minimax regret to 

switch from depending on the independence number to depending on the weak domination 
number. A striking example of this abrupt change is the loopy star graph, which is the union 
of the directed star (Fig. le) and all of the self-loops (Fig. lb). In other words, this example 
is a multi-armed bandit problem with a revealing action. The independence number of this 
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Algorithm 1: ExpS.G: online learning with a feedback graph 
Parameters: Feedback graph G = (V, E), learning rate f] > 0, 
exploration set U CV, exploration rate 7 G [0,1] 

Let u be the uniform distribution over U] 

Initialize qi to the uniform distribution over V] 

For round t = 1,2,... 

Compute pt = {1 — j)qt + ju; 

Draw It Pt, play It and incur loss Itilt)', 

Observe {(Dy)) -eiV“(/.)}; 

Update 


Vi e U 

Vi e U 


UU = 






^I{ie N°'^\lt)}, with 


Qt+iii) = 


qtji) exp{-pit{i)) . 




( 1 ) 

( 2 ) 


graph is K — 1, while its weak domination number is 1. Since the loopy star is strongly 
observable, it induces a game with minimax regret Q{y/TK). However, removing a single 
loop from the feedback graph turns it into a weakly observable graph, and its minimax regret 
rate changes to 0(T^/^) (with no polynomial dependence on K). 


3 The ExpS.G Algorithm 

The upper bounds for weakly and strongly observable graphs in Theorem 1 are both achieved 
by an algorithm we introduce, called ExpS.G (see Algorithm 1 ), which is a variant of the 
Exp 3-SET algorithm for undirected feedback graphs (Alon et ah, 2013). 

Similarly to Exp 3 and ExpS.SET, our algorithm uses importance sampling to construct 
unbiased loss estimates with controlled variance. Indeed, notice that Pt{i) = P(^ G N°'^^{It)) 
is simply the probability of observing the loss it{i) upon playing It ~ pt- Hence, £t{i) is an 
unbiased estimate of the true loss it{i), and for all t and z G U we have 

Eig(z)] = iS) and Etg(z)2] = ^ . (3) 

The purpose of the exploration distribution u is to control the variance of the loss estimates 
by providing a lower bound on Pt{i) for those z G U in the support of u; this ingredient will 
turn out to be essential to our analysis. 

We now state the upper bounds on the regret achieved by Algorithm 1 . 

Theorem 2. Let G = {V,E) be a feedback graph with K = \V\, independence number 
a = a{G) and weakly dominating number 6 = 6(G). Let D be a weakly dominating set such 
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that \D\ = S. The expected regret of Algorithm 1 on the online learning problem induced by 
G satisfies the following: 

(i) if G is strongly observable, then for U = V, j and rj = 27 , the 

expected regret against any loss seguence is 

(a) ifG is weakly observable andT > K^ln{K)/5‘^, thenforU = D, 7 = min{i} 
and V = ^, the expected regret against any loss seguence is C>(((5 In . 

In the previously studied self-aware case (i.e., strongly observable with self-loops), our 
result matches the bounds of Alon et al. (2014); Kocak et ah (2014). The tightness of our 
bounds in all cases is discussed in Section 4 below. 


3.1 A Tight Bound for the Loopless Clique 

One of the simplest examples of a feedback graph that is not self-aware is the loopless clique 
(Fig. Ic). This graph is strongly observable with an independence number of 1, so Theorem 2 
guarantees that the regret of Algorithm 1 in the induced game is 0{\/T\tli{KT)). However, 
in this case we can do better than Theorem 2 and prove (see Appendix C) that the regret of 
the same algorithm is actually 0{\/T In K), which is the same as the regret rate of the full 
feedback game (Fig. la). In other words, if we start with full feedback and then hide the 
player’s own loss, the regret rate remains the same (up to constants). 

Theorem 3. For any seguence of loss functions G,, ix, where G '■ V ^ [0,1], the regret of 
Algorithm 1, with the loopless cligue feedback graph and with parameters rj = ^(lniF)/(2T) 
and 7 = 2ri, is upper-bounded by 5\/T \nK. 


3.2 Refined Second-order Bound for Hedge 

Our analysis of Exp3.G builds on a new second-order regret bound for the classic Hedge 
algorithm.^ Recall that Hedge (Freund and Schapire, 1997) operates in the full feedback 
setting (see Fig. la), where at time t the player has access to losses ifii) for all s < f and 
z G H. Hedge draws action fi from the distribution pt dehned by 


Vz G H , 




exp ( -hEl=i^s(i)) 
Ejgv exp ( - r; 4(j)) ’ 


(4) 


where z; is a positive learning rate. The following novel regret bound is key to proving that 
our algorithm achieves tight bounds over the regret (to within logarithmic factors). 

Lemma 4. Let qi,... ,qT be the probability vectors defined by Eg. (4) for a sequence of loss 
functions G,... such that ifii) > 0 for all t = 1,... ,T and i E V. For each t, let St be 

second-order regret bound controls the regret with an expression that depends on a quantity akin to 
the second moment of the losses. 








a subset ofV such that it{i) < l/rj for all i G St- Then, for any i* E V it holds that 

T T T / \ 

t=i iev t=i ^ t=i \ieSt i^St J 

See Appendix A for a proof of this result. The standard second-order regret bound of 
Hedge (see, e.g., Cesa-Bianchi et al., 2007) is obtained by setting S'* = 0 for all t. Therefore, 
our bound features a slightly improved dependence (i.e., the 1 — qt{i) factors) on actions 
whose losses do not exceed 1/p. Indeed, in the analysis of ExpS.G, we apply the above 
lemma to the loss estimates it{i), and include in the sets St all strongly observable vertices i 
that do not have a self-loop. This allows us to gain a hner control on the variances jPt{i) 
of such vertices. 


3.3 Proof of Theorem 2 


We now turn to prove Theorem 2. For the proof, we need the following graph-theoretic 
result, which is a variant of Alon et al. (2014, Lemma 16); for completeness, we include a 
proof in Appendix A. 


Lemma 5. Let G = iV,E) be a directed graph with \V\ = K, in which each node i E V is 
assigned a positive weight Wi. Assume that Wi> e for all i E V for 

some constant 0 < e < |. Then 


E 


m 


. 1 477 

< 4a In- 

ae 


where a = a{G) is the independence number of G. 

Proof of Theorem 2. Without loss of generality, we may assume that K > 2. The proof 
proceeds by applying Lemma 4 and upper bounding the second-order terms it introduces. 
Indeed, since the distributions qi,q 2 ,... generated by Algorithm 1 via Eq. (2) are of the form 
given by Eq. (4), with the losses it replaced by the nonnegative loss estimates it, we may 
apply Lemma 4 to these distributions and loss estimates. The way we apply the lemma differs 
between the strongly observable and weakly observable cases, and we treat each separately. 

First, assume that G is strongly observable, implying that the exploration distribution u 
is uniform on V. Notice that for any i E V without a self-loop, namely with i ^ A^“(i), we 
have j E 77‘"(f) for all j ^ i, and so Ptii) = 1 — Pti'l)- On the other hand, by the dehnition 
of pt and since rj = 2'y and K > 2, we have pt{i) = (1 — l)gt{,i) + 

that Pt^) > P- Thus, we can apply Lemma 4 with St = S = {i : i ^ A^™(i)} to the vectors 
i\,... ,iT and take expectations, and obtain that 


E 


■ T 


2_^2^qt{i)Et[it{i)] - 

_t=l iGV 


T 

E 

t=i 


Et[i, 


pn] 


< 


InK 

V 


V 


E® 

t=i 


^gt(f)(l - qt{i))Et[it{f)^] + '^qt{i)Et[it{L) 


Lies 


i^S 
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for any fixed i* G V. Recalling Eq. (3) and Pt{i) = 1 — Ptii) for all i E S, we get 


E 


T 




t=l i€V 


\nK 


P 




ies 


t=i ' t=i 

The sum over i G S' on the right-hand side is bounded as follows: 

^ .^ 1 - Qtji) ^ 

1 — Ptii) 

t=l i£S ’ i=l ie5 




m 


< 2 5:E«.w £ 2 r. 


For the second sum, recall that any i ^ S has a self-loop in the feedback graph, and also 
that pt{i) > ^ as & result of mixing in the uniform distribution over V. Hence, we can use 
Pt{i) > (1 — l)<it{i) P \^t{i) and apply Lemma 5 with e = -^ that yields 


E 


gt(^) 

PS) 


< 


2E 


PS) 47 


i^S ^ ' ifS 

Putting everything together, and using the fact that pS) < qS) + 7 w(i) to obtain 


< '^qS)Si^ +1 , (5) 

iev iev 


results with the regret bound 


E 




t=i iev 


^ \r\ K / 

Yss < lT + -+ 27 T l + 4aln — 

7 \ 47 


t=i 


Substituting the chosen values of rj and 7 gives the first claim of the theorem. 

Next, assume that G is only weakly observable. Let D ^ V he a weakly dominating set 
supporting the exploration distribution u, with \D\ = 6. Similarly to the strongly observable 
case, we apply Lemma 4 to the vectors £ 1 ,... ,£t, but in this case we set = 0 for all t. 
Using Eqs. (3) and (5) and proceeding exactly as the strongly observable case, we obtain 


E 




t=i ieV 


t=i 


< 


7 T 


InRT 

7 ] 


+ P 


t=l 


E 

iev 


it(t) 

«(*) 


for any fixed i* G V. In order to bound the expectation in the right-hand side, consider 
again the set S' = {z : i ^ A^‘“(z)} of vertices without a self-loop, and observe that PS) = 
J2jeNi^{i)PtU) — i fo^ fol i E S. Indeed, if i is weakly observable then there exists some 
k E D such that k E iV'°(z) and Pt{k) > ^ because the exploration distribution u is uniform 
over D; if z is strongly observable then the same holds since z does not have a self-loop and 
thus must be dominated by all other vertices in the graph. Hence, 

qS) ^ qS) , qS) ^ ^ i o/y 
PS) ~ ^ PtS ^ PS) “7 

i€V ' leS ’ i^s ^ ’ ' 
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where we used Pt{i) > Pt{i) > to bound the sum over the vertices having 

a self-loop. Therefore, we may write 


E 




t=l i£V 


< ^T+^-^ + '^T + 2t]KT . 
^ h 7 


Substituting our choices of rj and 7 , we obtain the second claim of the theorem. □ 


4 Lower Bounds 

In this section we prove lower bounds on the minimax regret for non-observable and weakly 
observable graphs. Together with Theorem 2 and the known lower bound of f^(\/a(G)T) for 
strongly observable graphs (Alon et ah, 2014, Theorem 5),^ these results complete the proof 
of Theorem 1. We remark that their lower bound applies when T > a{G)^, which includes 
our regime of interest. We begin with a simple lower bound for non-observable feedback 
graphs. 

Theorem 6 . If G = (y,E) is not observable and \V\ > 2, then for any player algorithm 
there exists a sequence of loss functions £ 1 , £ 2 , • • • : t [0, 1 ] such that the player’s expected 

regret is at least jT. 

The proof is straightforward: if G is not observable, then it is possible to hnd a vertex of 
G with no incoming edges; the environment can then set the loss of this vertex to be either 0 
or 1 on all rounds of the game, and the player has no way of knowing which is the case. For 
the formal proof, refer to Appendix B. Next, we prove a lower bound for weakly observable 
feedback graphs. 

Theorem 7. If G = {V,E) is weakly observable with K = \V\ > 2 and weak domination 
number 6 = S{G), then for any randomized player algorithm and for any time horizon T 
there exists a sequence of loss functions £ 1 ,..., t [ 0 , 1 ] such that the player’s expected 

regret is at least -^{5/ . 

The proof relies on the following graph-theoretic result, relating the notions of domination 
and independence in directed graphs. 

Lemma 8 . Let G = (y,E) be a directed graph over \V\ = n vertices, and let W O V be a 
set of vertices whose minimal dominating set is of size k. Then, W contains an independent 
set U of size at least -Pk/\nn, with the property that any vertex of G dominates at most Inn 
vertices of U. 

^While Alon et al. (2014) only consider the special case of graphs that have self-loops at all vertices, 
their lower bound applies to any strongly observable graph: we can simply add any missing self-loops to the 
graph, without changing its independence number a. The resulting learning problem, whose minimax regret 
is is only easier for the player who may ignore the additional feedback. 
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Proof. If /c < 50 Inn the statement is vacuous; hence, in what follows we assume k > 50 Inn. 
Let (3 = (2 In n)//c < 1. Our hrst step is to prove that W contains a non-empty set R such that 
each vertex of G dominates at most [3 fraction of i?, namely such that |iV°“*(n) O i?| < /9|i?| 
for all V G V. To prove this, consider the following iterative process: initialize R = W, and 
as long as there exists a vertex v G V such that ni?| > I3\R\, remove all the vertices 

V dominates from R. Notice that the process cannot continue for k (or more) iterations, 
since each step the size of R decreases at least by a factor of 1 — /9, so after k — 1 steps we 
have |i?| < n(l — = 1. On the other hand, the process cannot end with 

/? = 0, as in that case the vertices v found along the way form a dominating set of W whose 
size is less than k, which is a contradiction to our assumption. Hence, the set R at the end 
of process must be non-empty and satisfy O i?| < I3\R\ for all v E V, as claimed. 

Next, consider a random set S' C i? formed by picking a multiset S' of m = elements 
from R independently and uniformly at random (with replacement), and discarding any 
repeating elements. Notice that m < as |i?| > fl R\ for any v E V, and for 

some V the right-hand side is non-zero. The proof proceeds via the probabilistic method: we 
will show that with positive probability. S' contains an independence set as required, which 
would give the theorem. 

We hrst observe the following properties of the set S'. 

Claim. With probability at least it holds that [S'! > -^rn. 

To see this, note that each element from R is not included in S' with probability (1 — ^)™' < 
Q-m/r ^ Since m < -^r, the expected size of S' is at least r(l — = 

^g-m/rj-gm/r _ 1 ) > where both inequality use > a; -f 1. Since always 

[S'! < m, Markov’s inequality shows that [S'! > with probability at least |; otherwise, 
we would have Ellis'!] < -|-mP(|S'| > < -^m. 

Claim. With probability at least we have |iV°’^*(n) fl S'! < Inn for all v E V. 

Indeed, hx some v E V and recall that v dominates at most a P fraction of the vertices 
in R, so each element of S' (that was chosen uniformly at random from R) is dominated by v 
with probability at most f3. Hence, the random variable = |iV°“*(t;) fl S'! has a binomial 
distribution Bin(m,p) with p < (3. By a standard binomial tail bound, 

P(X, >lnn) < < (m/3)'°" < • 

\ln nj 

The same bound holds also for the random variable = |iV°“*(t;) fl S'!, that can only be 
smaller than Xy. Our claim now follows from a union bound over all v E V. 

Claim. With probability at least we have '^y^s O S'! < |. 

To obtain this, we note that for each v E V the random variable Xy = |iV°“*(r;) fl S'! 
dehned above has E[X^] < E[X„] < m/? < and therefore gy 

Markov’s inequality we then have ^ probability less than |, which gives 

the claim. 
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The three claims together imply that there exists a set S' C IT of size at least 
such that any v E V dominates at most Inn vertices of S', and the average degree of the 
induced undirected graph over S is at most 1. Hence, by Turan’s Theorem,^ S contains 
an independent set U of size > -^fc/lnn. This concludes the proof, as each v E V 
dominates at most Inn vertices of U. □ 

Given Lemma 8, the idea of the proof is quite intuitive; here we only give a sketch of the 
proof, and defer the formal details to Appendix B. 

Proof of Theorem 7 (sketch). First, we use the lemma to hnd an independent set U of weakly 
observable vertices of size G((5), with the crucial property that each vertex in the entire graph 
dominates at most 0{1) vertices of U. Then, we embed in the set U a hard instance of the 
stochastic multiarmed bandit problem, in which the optimal action has expected loss smaller 
by e than the expected loss of the other actions in U. To all other vertices of the graph, we 
assign the maximal loss of 1. Hence, unless the player is able to detect the optimal action, 
his regret cannot be better than Q{eT). 

The main observation is that, due to the properties of the set U, in order to obtain 
accurate estimates of the losses of all actions in U the player has to use r2(h) different 
actions outside of U and pick each for VL{l/e^) times. Since each such action entails a constant 
instantaneous regret, the player has to pay an VL{5/e^) penalty in his cumulative regret for 
exploration. The overall regret is thus of order r2(min{eT, (5/e^}), which is maximized at 
e = and gives the stated lower bound. □ 


5 Time-Varying Feedback Graphs 

The setting discussed above can be generalized by allowing the feedback graphs to change ar¬ 
bitrarily from round to round (see Mannor and Shamir (2011); Alon et ah (2013); Kocak et al. 
(2014)). Namely, the environment chooses a sequence of feedback graphs Gi,... ,Gt along 
with the sequence of loss functions. We consider two different variants of this setting: in 
the informed model, the player observes Gt at the beginning of round t, before drawing the 
action It. In the harder uninformed model, the player observes Gt at the end of round t, 
after drawing It. In this section, we discuss how our algorithm can be modified to handle 
time-varying feedback graphs, and whether this generalization increases the minimax regret 
of the induced online learning problem. 

Strongly Observable. If Gi,..., Gt are all strongly observable. Algorithm 1 and its anal¬ 
ysis can be adapted to the time-varying setting (both informed and uninformed) with only 
a few cosmetic modihcations. Specihcally, we replace G with Gt, to dehne time-dependent 
neighborhoods, and A)°, in Eq. (1) of the algorithm. This modihcation holds in both 
the informed and uninformed models because the structure of the feedback graph is only used 

"^Turan’s Theorem (e.g., Alon and Spencer, 2008) states that in any undirected graph whose average 
degree is d, there is an independent set of size n/{d + 1). 
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to update qt+i-, which takes place after the action It is chosen. Moreover, the upper-bound in 
Theorem 2 can be adapted to the time-varying model by replacing a with ^ Ym=i where 
each at is the independence number of the corresponding Gt (e.g., using a doubling trick, or 
an adaptive learning rate as in Kocak et ah (2014)). 

Weakly Observable, Informed. If Gi,...,Gr are all weakly observable. Algorithm 1 
can again be adapted to the informed time-varying model, but the required modihcation is 
more substantial than before, and in particular, relies on the fact that Gt is known before 
the prediction on round t is made. The exploration set U must change from round to round, 
according to the feedback graph. Specihcally, we choose the exploration set on round t to 
be Dt, the smallest weakly dominating set in Gt- We then dehne Ut to be the uniform 
distribution over this set, and pt = {I — '))qt + quo Again, via standard techniques, the 
upper-bound in Theorem 2 can be adapted to this setting by replacing 5 with 4 ^^=1 
where 6t = \Dt\. 

Weakly Observable, Uninformed. So far, we discussed cases where the minimax regret 
rates of our problem do not increase when we allow the feedback graphs to change from 
round to round. However, if Gi,... ,Gt are all weakly observable and they are revealed 
according to the uninformed model, then the minimax regret can strictly increase. Recall 
that Theorem 1 states that the minimax regret for a constant weakly observable graph is 
0(51/3 T2/3), where 6 is the size of the smallest weakly dominating set. We now show that 
the minimax regret in the analogous uninformed setting is ©(U^/^ T^/^), where K is the 
number of actions. The T^/^) upper bound is obtained by running Algorithm 1 with 

uniform exploration over the entire set of actions (namely, U = V). To show that this bound 
is tight, we state the following matching lower bound. 

Theorem 9. For any randomized player strategy in the uninformed feedback model, there 
exists a sequence of weakly observable graphs Gi,..., Gt over a set V of K > A actions with 
S{Gt) = a{Gt) = 1 for all t, and a sequence of loss functions G,... '■ V [0,1], such 

that the player’s expected regret is at least 

We sketch the proof below, and present it in full detail in Appendix B. 

Proof (sketch). For each t = 1,...,T, construct the graph Gt as follows: start with the 
complete graph over K vertices (that includes all self-loops), and then remove the self-loop 
and all edges incoming to i = 1 except of a single edge incoming from some vertex jt 7 / 1 
chosen arbitrarily. Notice that the resulting graph is weakly observable (each vertex is 
observable, but i = 1 is only weakly observable), has d{Gt) = 1 since jt dominates the entire 
graph, and a{Gt) = 1 as each two vertices are connected by at least one edge. However, 
for observing the loss of i = 1 the player has to “guess” the revealing action jt, that might 
change arbitrarily from round to round. This random guessing of one out of H(iF) actions 
introduces the factor in the resulting bound. □ 
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A Additional Proofs 


A.l Proof of Lemma 4 

In order to prove onr new regret bonnd for Hedge, we hrst state and prove the standard 
second-order regret bonnd for this algorithm. 

Lemma 10. For any rj > 0 and for any seguence C,... ,It of loss functions such that Cif) > 
— \/r] for all t and i, the probability vectors qi,... ,qT of Eg. (4) satisfy 


T 


EE E it{k) 


< 


t=l iGV 


i=l 


In a: 
V 


T 




i=i iev 


Proof. The proof follows the standard analysis of exponential weighting schemes: let Wt{i) = 
exp(—f's(i)) and let Wt = ■ Then qt{i) = Wt{i)/Wt and we can write 




t+i 


Wt 


E 

iev 

E 


Wt 


wt(i) exp(-r/£t(i)) 


iev 


Wt 

= expl-rjltii)) 

iev 

< ^ qt{i) (l — rjltii) + (nsing < 1 -1- a; -1- for all x < 1) 

iev 

iev iev 

Taking logs, nsing ln(l — x) < —x for all a; > 0, and snmming over t = 1,... ,T yields 

T 


Wi 


t=i iev 
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Moreover, for any fixed action fc, we also have 


Wt+1 , WT+i{k) \^en\ 1 

‘“irr - . 

Pntting together and rearranging gives the resnlt. □ 

We can now prove Lemma 4, restated here for the convenience of the reader. 

Lemma 4 (restated). Let qi,...,qT be the probability vectors defined by Eq. (4) for a se¬ 
quence of loss functions £i,... ,iT such that it{i) > 0 for all t = 1,...,T and i E V. For 
each t, let St be a subset ofV such that ifii) < l/rj for all i G St. Then, it holds that 


+ p^\^qt{i){l - qt{i))£t{if E^qt{i)£t{i) 
t=i iev t=i ^ t=i \i£St 


iiSt 


Proof. For all t, let It = which It < l/rj by constrnction. Notice 

that execnting Hedge on the loss vectors is eqnivalent to execnting in on vec¬ 

tors ... ,I'rp with £'fii) = Ifil) — It for all i. Applying Lemma 10 for the latter case (notice 
that I'tii) > —l/rj for all t and i), we obtain 

T T T T 


t=l i£V 


t=l 


t=l i£V 


< 


In A 
rj 

In A 
rj 


t=i 

\2 


t=l i£V 
T 


t=l i£V 


On the other hand, for all t, 


i£St 


< 


ieSt i<£St 

ieSt i£St 

i£St 

where the ineqnality follows from the non-negativity of the losses Ifii). Also, since ifii) > 
l/rj > It for all i ^ St, we also have 

i^St iiSt 

Combining the ineqnalities gives the lemma. □ 
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A.2 Proof of Lemma 5 


Lemma 5 (restated). Let G = {V,E) be a directed graph with \V\ = K, in which each node 
i E V is assigned a positive weight Wi. Assume that Wi > e for all 

i E V for some constant 0 < e < ^. Then 


E 

iev 


Wi 


Wi + Yl 


'^3 


< 4a In 


4K 


ae 


where a = a{G) is the independence number of G. 


Proof. Following the proof idea of Alon et al. (2013), let M = [2iF/e] and introduce a 
discretization of the values Wi,... ,wt such that (m* — 1 )/M < Wt < mi/M for positive 
integers mi,... Since each Wi > e, we have m* > Mwi > ^ • e = 2K. Hence, we 
obtain 


E 

i&V 


_wh_ 

Wi + Wj 


E 

i&V 


_ rrh _ 

Wli + wij - K 


< 


2E 

i&V 


_rui_ 

"i* + EisiVinp) ’ 


( 6 ) 


where the final inequality is true since K < ^mi < |(mj + J2j£N^^{i) w^j)- 

Now, consider a graph G' = {V\ E') created from G by replacing each node i eV with a 
clique Gi over m* vertices, and connecting each vertex of Gi to each vertex of Gj if and only if 
the edge (i, j) is present in G. Then, the right-hand side of Eq. ( 6 ) equals where 

di is the in-degree of the vertex i eV in the graph G'. Applying Lemma 13 of Alon et al. 
(2013) to the graph G", we can show that 


E 


rrh 


< 


2 a In 



a 


< 2 a In ( 1 -|- 


M + K 


a 


, 4K 

< 2 a In- , 

ae 


and the lemma follows. 


□ 


B Proofs of Lower Bounds 

B.l Non-observable Feedback Graphs 

We first prove Theorem 6 . 

Theorem 6 (restated). If G = (y,E) is not observable and \V\ > 2, then for any player 
algorithm there exists a seguence of loss functions £i, £ 2 , • • • : F i—)■ [0, 1 ] such that the player’s 
expected regret is at least |T. 

Proof. Since G is not observable, there exists a node with no incoming edges, say node i = 1. 
Consider the following randomized construction of loss functions Li,L 2 ,... : V 1 —)■ [0,1]: 
draw X G {0,1} uniformly at random and set 

f X if i = 1 , 
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Now fix some strategy of the player (which, without loss of generality, we may assume to be 
deterministic) and denote by M the random number of times it chooses action i = 1. Notice 
that the player’s actions, and consequently M, are independent of the random variable y 
since the player never observes the loss value assigned to action i = 1. Letting Rt denote the 
player’s regret after T rounds, it holds that E[i?r] (where expectation is taken with respect 
to the randomization of the loss functions) satishes 

E[i?r] = iE[iM|x = l]+iE[|(T-M)|x = 0] 

= iE[lM+i(T-M)] 

= \T- 

This implies that there exists a realization ii,... ,iT of the random functions for which the 
regret is at least |T, as claimed. □ 

B.2 Weakly observable Feedback Graphs 

We now turn to prove our main lower bound for weakly observable graphs, stated in Theo¬ 
rem 7. 

Theorem 7 (restated). If G = iV,E) is weakly observable with K = \V\ > 2 and weak 
domination number 6 = S{G), then for any randomized player algorithm and for any time 
horizon T there exists a sequence of loss functions ii,...,iT ■ V i—)■ [0,1] such that the 
player’s expected regret is at least -^{5/ \t?. 

Before proving the theorem, we recall the key combinatorial lemma it relies upon. 

Lemma 8 (restated). Let G = {V,E) be a directed graph over \V\ = n vertices, and let 
W V be a set of vertices whose minimal dominating set is of size k. Then, W contains 
an independent set U of size at least ^(fc/lnn), with the property that any vertex of G 
dominates at most Inn vertices ofU. 

Proof of Theorem 7. As the minimal dominating set of the weakly observable part of G is 
of size 6, Lemma 8 says that G must contain an independent set U of m > 5/(50 In 77) 
weakly observable vertices, such that any v gV dominates at most In 77 vertices of U. For 
simplicity, we shall assume that 6 > 100 In 77 which ensures that the set U consists of at 
least m > 2 vertices; a proof of the theorem for the (less interesting) case where 6 < 100 In 77 
is given after the current proof. 

Consider the following randomized construction of loss functions Li ,..., Lt : F i—)■ [0,1]: 
hx e = m^/^(32T In 77)“^/^, choose y G t/ uniformly at random and for all t and i, and let 
the loss Lt{i) ~ Ber(/ii) be a Bernoulli random variable with parameter 

{ i - e ifi = x, 

I if i e U, i Y X, 

1 iiU. 
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We refer to actions in U as “good” actions (whose expected instantaneous regret is at most 
e), and to actions in \ as “bad” actions (with expected instantaneous regret larger than 
|). Notice that iV™(i) <^V\U for all good actions i ^ U, since U is an independent set of 
weakly observable vertices (that do not have self-loops). In other words, in order to observe 
the loss of a good action in a given round, the player has to pick a bad action on that round. 

Fix some strategy of the player which we assume to be deterministic (again, this is 
without loss of generality). Up to a constant factor in the resulting regret lower bound, we 
may also assume that the strategy chooses bad actions at most eT times with probability one 
(i.e., over any realization of the stochastic loss functions). Indeed, we can ensure this is the 
case by simply halting the player’s algorithm once it chooses bad actions for more than eT 
times, and picking an arbitrary good action in the remaining rounds; since the instantaneous 
regret of a good action is at most e, the regret of the modihed algorithm is at most 3 times 
larger than the regret of the original algorithm (the latter regret is at least |eT, while the 
modihcation results in an increase of at most eT in the regret). 

Denote by Ji,..., Jr the sequence of actions played by the player’s strategy throughout 
the game, in response to the loss functions Ti,..., Lt- For all f, let Yt be the vector of loss 
values observed by the player on round f; we think about Yt as being a full JF-vector, with 
the unobserved values replaced by —1. For all i E U, let Mj be the number of times the 
player picks the good action i, and W be the number of times the player picks a bad action 
from 7V“(i). Also, let M be the total number of times the player picks a good action, and 
N be the number of times he picks a bad action. Notice that Ylieu as each 

vertex in U \ U dominates at most In K vertices of U by construction. This, together with 
our assumption that N < eT with probability one (i.e., that the player picks bad actions for 
at most eT times), implies that 


< eTlnJF. (7) 

ieu 

In order to analyze the amount of information on the value of y the player obtains by 
observing the Yts, we let T be the a-algebra generated by the observed variables Yi,..., Ft, 
and dehne the conditional probability functions Q*(-) = P( • | y = i) over T”, for all i E U. 
Notice that under Q\ action i is the optimal action. For technical purposes, we also let Q°(-) 
denote the hctitious probability function induced by picking x = 0 ; under this distribution, 
all good actions in U have an expected loss equal to For two probability functions Q, Q 
over T”, we denote by 


Dtv(Q,Q') = sup|Q(A)-Q'(A)| 

A&T 

the total variation distance between Q, and Q' with respect to T”. Then, we can bound the 
total variation distance between and each of the Q®’s in terms of the random variables 
W, as follows. 

Lemma. For each i E U, we have Dtv(Q°, Q*) < eA/2EQ0 [W]- 
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Proof. As an intermediate step, we first upper bound the KL-divergence between Q* and 
in terms of the random variable Aj. Let Ql = Q^{ - \ fh,..., Yt_i) for all j. Notice that Q,\ 
and Qfl are identical unless the player picked an action from N™{i) on round t. In this latter 
case, DKL(Qt, Q\) equals the KL-divergence between two Bernoulli random variables with 
biases \ and | — e, which is upper bounded by 4e^ for e < Thus, using the chain rule for 
relative entropy we may write 

T 

Dkl(q".q‘) = 5;dkl(q?,s:) 

t=i 

T 

= S°{!‘ 6 ^”W) ■ DKL(Ber(i), Ber(i - e)) 

t=l 

T 

< 4e2^Q°(Ae A“(*)) = 4e2EQo[Ai]. 

t=i 


By Pinsker’s inequality we have Dtv(Q°, Q*) < ■\/|Dkl(Q°, Q*), which gives the lemma. □ 

Averaging the lemma’s inequality over i G f/, using the concavity of the square-root and 
recalling Eq. (7), we obtain 


1 

m 


5 ^Dtv(Q°,Q') 

i£U 


< 


2 e 2 

—EqO 
m 


jeu 


< 


2e3 

—Tin a: 

m 


1 

4 ’ 


( 8 ) 


where the hnal equality follows from our choice of e. 

We now turn to lower bound the player’s expected regret. Since the player incurs (at 
least) e regret each time he picks an action different from y, his overall regret is lower 
bounded by e(T — M^), whence 

E[Tt] > -5^E[6(T-MJ |y = z] = eT -. (9) 

m ^^ m ^^ 


In order to bound the sum on the right-hand side, note that 


T 

Ee.[M,]-EQo[M,] = Y,{Q\It = i)-Q\lt = i)) < T-Dtv(Q°,Q*) , 

t=i 


and average over i G 17 to obtain 


-VE2.[MJ < -\^Dtv(Q°,Q*) + -Eqo 
m ^^ rn ^^ m 

i&U i^U 




li&U 


11 3 

< -T + —T < -T , 
4 m 4 


®This KL-divergence equals \ In Yp^ + 1 1^^ 1 / 2 +e ~ 1+ T^t') 7 \ • 7 4e^, where the last 

step is valid for e < j. 
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where the last inequality is due to m > 2. Combining this with Eq. (9) yields E[i? 2 ’] > ;|eT, 
and plugging in our choice of e gives 




1 / m 

4l321nir 


l/3^2/3 > 

SOln^/^iC ’ 


which concludes the proof (recall the additional ^-factor stemming from our simplifying 
assumption made earlier). □ 


The claim of the theorem for the case 6 < 100 In it", that remained unaddressed in the 
proof above, follows from a simpler lower bound that applies to weakly observable graphs of 
any size. 

Theorem 11. If G = (V, E) is weakly observable and \V\ > 2, then for any player algorithm 
and for any time horizon T there exists a sequence of loss functions G,... ,iT ■ V [0,1] 
such that the player’s expected regret is at least 

Proof. First, we observe that any graph over less than 3 vertices is either non-observable or 
strongly observable; in other words, any weakly observable graph has at least 3 vertices, so 
\V\ > 3. Now, if G is weakly observable, then there is a node of G, say i = I, without a 
self-loop and without an incoming edge from (at least) one of the other nodes of the graph, 
say from j = 2 . Since |E| > 3 and the graph is observable, i = 1 has at least one incoming 
edge from a third node of the graph. 

Consider the following randomized construction of loss functions Li,..., Lt : E i—)■ [0,1]: 
£x e = choose y G {—uniformly at random and for all t and z, let the loss 

Lt{i) ~ Ber(/ij) be a Bernoulli random variable with parameter 

r I - ey if z = 1 , 

Pi = < I if z = 2, 

I 1 otherwise. 


Here, the “good” actions (whose expected instantaneous regret is at most e) are z = 1 and 
z = 2 , and all other actions are “bad” actions (with expected instantaneous regret larger 
than |). 

Now, £x a deterministic strategy of the player and let the random variable Ni be the 
number of times the player chooses a bad action from iV™(l). Dehne the conditional prob¬ 
ability functions Q^') = ' I X = +1) and Q^{-) = ^{ - \ X = ~1) where under Q* action 

z is the optimal action. Also, dehne to be the hctitious distribution induced by setting 
y = 0, under which the actions z = 1 and z = 2 both have an expected loss of Then, 
exactly as in the proof of Theorem 7, we can show that 

Dtv(Q°,Q*) < eV 2 EQ.[iVi] , z = l, 2 . 

Averaging the two inequalities and using the concavity of the square root, we obtain 

2 Dtv(Q°,Q') + iDtv(Q°,Q') < eVEQi[iVi]+EQ 2 [iVi] = e^/2E[iG] , ( 10 ) 
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where we have used the fact that P(-) = 

We can now analyze the player’s expected regret, again denoted by Notice that if 

E[iVi] > we have E[i?T] > E[|W] > (since each action that reveals the 

loss of i = 1 is a bad action whose instantaneous regret is at least |), which gives the required 
lower bound. Hence, we may assume that E[iVi] < which case the right-hand side 

of Eq. (10) is bounded by This yields an analogue of Eq. (8), from which we can proceed 
exactly as in the proof of Theorem 7 to obtain that E[i?T] > Using our choice of e gives 
the theorem. □ 

B.3 Separation Between the Informed and Uninformed Models 

Finally, we prove our separation result for weakly observable time-varying graphs, which 
shows that the uninformed model is harder than the informed model (in terms of the depen¬ 
dence on the feedback structure) for weakly observable feedback graphs. 

Theorem 9 (restated). For any randomized player strategy in the uninformed feedback 
model, there exists a sequence of weakly observable graphs Gi ,..., Gt over a set V of K > A 
actions with S(Gt) = a(Gt) = 1 for all t, and a sequence of loss functions G,..., ir ■ V ^ 
[0,1], such that the player’s expected regret is at least 

Proof. As before, it is enough to demonstrate a randomized construction of weakly observ¬ 
able graphs Gi,..., Gt and loss functions Li,... ,Lt such that the expected regret of any 
deterministic algorithm is 

The random loss functions Li,..., Lt are constructed almost identically to those used 
in the proof of Theorem 11; the only change is in the value of e, which is now fixed to 
e = j{K/Ty^^. In order to construct the random sequence of weakly observable graphs 
Gi, ..., Gt, first pick nodes Ji,..., Jt independently and uniformly at random from V = 
{3,..., W}. Then, for each t, form the graph Gt by taking the complete graph over V (that 
includes all directed edges and self-loops) and removing all edges incoming to node i = 1 
(including its self-loop), except for the edge incoming from Jt. In other words, the only way 
to observe the loss ^^(1) of node 1 on round t is by picking the action Jt on that round. 
Notice that Gt is weakly observable, as each of its nodes has at least one incoming edge, but 
there is a node (node 1) which is not strongly observable. Also, we have d{Gt) = 1 since Jt 
dominates the entire graph, and a{Gt) = 1 as any pair of vertices is connected by at least 
one directed edge. 

We now turn to analyze the expected regret of any player on our construction; the analysis 
is very similar to that of Theorem 11, and we only describe the required modifications. Fix 
any deterministic algorithm, and define the random variables A,..., U and Ni exactly as 
in the proof of Theorem 11. In addition, define the distributions Q}, and as in that 
proof, for which we proved (recall Eq. (10)) that 

iDTv(Q°,Q') + iDTv(Q°,Q') < ey2E|i^. (11) 

Now, define another random variable N to be the number of times the player picked an 
action from V throughout the game. Notice that in case E[W] > we have 
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E[-Rt] > IE[|iV] > which implies the stated lower-bound on the expected regret. 

Hence, in what follows we assume that K[N] < Notice that for the graphs we 

constructed, Q{It = Jt) < ^Q{h ^ V) since Jt is picked uniformly at random from V 
(and independently from It because in the uninformed model Gt is not known when It is 
drawn) and since K > A. Summing this over f = 1,..., T, we obtain that E[iVi] < -^E[A^] < 
and with our choice of e this shows that the right-hand size of Eq. (11) is upper 
bounded by Again, continuing exactly as in the proof of Theorem 7, we hnally get that 
E[i?T] > and with our choice of e this concludes the proof. □ 


C Tight Bounds for the Loopless Clique 

We restate and prove Theorem 3. 

Theorem 3 (restated). For any sequence of loss functions ii,... ,iT, where G : V ^ [0,1], 
the expected regret of Algorithm 1, with the loopless clique feedback graph and with parameters 
T] = a/(I n A')/(2T) and 7 = 2p, is upper-bounded by 5\/T \nK. 

Proof. Since G is strongly observable, the exploration distribution u is uniform on V. Fix 
any i* G V. Notice that for any i E V we have j G A^™(i) for all j 7 ^ i, and so Pt{i) = 
1 — pt{i). On the other hand, by the dehnition of pt and since 7 = 27 and K > 2, we have 
Pt{i) = (1 — j)qt{i) + -^ < 1 — 7 = 1 — 7 , so that Pt{i) > rj. Thus, we can apply 

Lemma 4 with St = V to the vectors ii,... ,£t and take expectations. 


E 


K 


5 ; 5;«,we,i«;(*)|-e,i«;(**)] 


J =1 \i=l 

Recalling Eq. (3) and Pt{i) = 1 —Pt{i), we get 
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Finally, for the distributions pt and qt generated by the algorithm we note that 

> (1 - ;^)( 1 -gi(i)) > \ {l-qt{i)) 
where the last inequality holds since K >2. Hence, 

^ < 2T. 


t=i iev 

Combining this with Eq. (5) gives 

T T 


t=l i£V 


E 


7=1 i£V t=l 


In K In K 

< 7 T H-h 2t]T = -h dpT , 


T] T] 

where we substituted our choice 7 = 2^. Picking rj = (In iP)/2T proves the theorem. □ 
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D Connections to Partial Monitoring 

In online learning with partial monitoring the player is given a loss matrix L over [0,1] and a 
feedback matrix H over a hnite alphabet E. The matrices L and H are both of size K x M, 
where K is the number of player’s actions and M is the number of environment’s actions. 
The environment preliminarily hxes a sequence of actions (i.e., matrix column 

indices) hidden from the player.® At each round t = 1,2,..., the loss of the player 

choosing action It (i.e., a matrix row index) is given by the matrix entry L{It,yt) G [0,1]. 
The only feedback that the player observes is the symbol H{It,yt) G E; in particular, the 
column index yt and the loss value L{It,yt) remain both unknown. The player’s goal is to 
control a notion of regret analogous to ours, where the minimization over V is replaced by 
a minimization over the set of row indices, corresponding to the player’s K actions. 

We now introduce a reduction from our online setting to partial monitoring for the 
special case of {0, l}-valued loss functions (note that our lower bounds still hold under this 
restriction, and so does our characterization of Theorem 1). Given a feedback graph G, we 
create a partial monitoring game in which the environment has a distinct action for each 
binary assignment of losses to vertices in V. Hence, L and H have K rows and M = 2^ 
columns, where the union of columns in L is the set {0,1}^. The entries of H encode G 
using any alphabet E such that, for any row z G H and for any two columns y ^ y', 

H(i,y) = H(i,y')-^ !^{k,L{k,y)) : k € = [{k, L(k,y')) : i e . (12) 

Note that this is a bona fide reduction: given a partial monitoring algorithm A, we can 
dehne an algorithm A' for solving any online learning problem with known feedback graph 
G = (H, E) and {0, l}-valued loss functions. The algorithm A' pre-computes a mapping from 
{(fc, L(fc, y)) : k G A^°“*(z)} for each i eV and for each z/ = 1,..., M to the alphabet E such 
that Eq. (12) is satished. Then, at each round t, A' asks A to draw a row (i.e., a vertex of V) 
It and obtains the feedback | (fc, L(fc, z/t)) : k G A^°“*(/t)} from the environment. Finally, 
A' uses the pre-computed mapping to obtain the symbol Ut G E which is fed to A. 

The minimax regret of partial monitoring games is determined by a set of observability 
conditions on the pair {L,H). These conditions are expressed in terms of a canonical rep¬ 
resentation of H as the set of matrices St for z G H. St has a row for each distinct symbol 
(T G E in the z-th row of H, and Sfia, y) = z/) = cr} for z/ = 1,..., M. When cast to 

the class of pairs (L, H) obtained from feedback graphs G through the above encoding, the 
partial monitoring observability conditions of Bartok et al. (2014, Dehnitions 5 and 6) can 
be expressed as follows. Let L(z, •) be the column vector denoting the Tth row of L. Let also 
rowsp be the rowspace of a matrix and © be the cartesian product between linear spaces. 
Then 

® The standard definition of partial monitoring (see, e.g., Cesa-Bianchi and Lugosi, 2006, Section 6.4) 
assumes a harder adaptive environment, where each action yt is allowed to depend on all of past player’s 
actions /i,... However, the partial monitoring lower bounds of Antes et al. (2013, Theorem 13) and 

Bartok et al. (2014, Theorem 3) hold for our weaker notion of environment as well. 
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• (L, H) is globally observable if for all pairs i,jEV of actions, 

Hh •) - L{j, •) e 0 rowsp(^fc) 

k=l,...,K 

• [L, H) is locally observable if for all pairs i,j^V of actions, 

L{i, •) — L(j, •) e rowsp(S'j © rowsp(S'j) . 

The characterization result for partial monitoring of Bartok et al. (2014, Theorem 2) states 
that the minimax regret is of order \/T for locally observable games and of order 
for globally observable games. We now prove that the above encoding of feedback graphs 
G as instances (L, H) of partial monitoring games preserves the observability conditions. 
Namely, our encoding maps weakly (resp., strongly) observable graphs G to globally (resp., 
locally) observable instances of partial monitoring. Combining this with our characterization 
result (Theorem 1) and the partial monitoring characterization result (Bartok et ah, 2014, 
Theorem 2), we conclude that the minimax rates are preserved by our reduction. 

Claim 12. If j G then there exists a subset Sq of rows of Si such that 

L(j, ■)^Y. ■) ■ 

Proof. Let Sq to be the union of rows Si{ay, •) such that H{i,y) = ay and L{j,y) = 1 for 
some y. Each such row has a 1 in position y because Si{ay,y) = 1 holds by dehnition. 
Moreover, no such row has a 1 in a position y' where L{j, y') = 0. Indeed, combining 
i G with Eq. (12), we get that L{j,y') = 0 implies H{i,y') ^ ay, which in turn 

implies Si{ay,y') = 0 . □ 

Theorem 13. Any feedback graph G can be encoded as a partial monitoring problem {L,H) 
such that the observability conditions are preserved. 

Proof. If G is weakly observable, then for every j & V there is some i E V such that 
j G A^°“*(i). By Claim 12, L{j, •) G rowsp(S'j) and the global observability condition follows. 
If G is strongly observable, then for any distinct i,j G V the subgraph G' of G restricted 
to the pair of vertices i,j is weakly observable. By the previous argument, this implies that 
L{i, ■) — L{j, •) G rowsp(S'i) © rowsp(S'j) and the proof is concluded. □ 
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